Goto

Collaborating Authors

 angular error


Adaptive Plane Reformatting for 4D Flow MRI using Deep Reinforcement Learning

Bisbal, Javier, Sotelo, Julio, Valdés, Maria I, Irarrazaval, Pablo, Andia, Marcelo E, García, Julio, Rodriguez-Palomarez, José, Raimondi, Francesca, Tejos, Cristián, Uribe, Sergio

arXiv.org Artificial Intelligence

Background and Objective: Plane reformatting for four-dimensional phase contrast MRI (4D flow MRI) is time-consuming and prone to inter-observer variability, which limits fast cardiovascular flow assessment. Deep reinforcement learning (DRL) trains agents to iteratively adjust plane position and orientation, enabling accurate plane reformatting without the need for detailed landmarks, making it suitable for images with limited contrast and resolution such as 4D flow MRI. However, current DRL methods assume that test volumes share the same spatial alignment as the training data, limiting generalization across scanners and institutions. To address this limitation, we introduce AdaPR (Adaptive Plane Reformatting), a DRL framework that uses a local coordinate system to navigate volumes with arbitrary positions and orientations. Methods: We implemented AdaPR using the Asynchronous Advantage Actor-Critic (A3C) algorithm and validated it on 88 4D flow MRI datasets acquired from multiple vendors, including patients with congenital heart disease. Results: AdaPR achieved a mean angular error of 6.32 +/- 4.15 degrees and a distance error of 3.40 +/- 2.75 mm, outperforming global-coordinate DRL methods and alternative non-DRL methods. AdaPR maintained consistent accuracy under different volume orientations and positions. Flow measurements from AdaPR planes showed no significant differences compared to two manual observers, with excellent correlation (R^2 = 0.972 and R^2 = 0.968), comparable to inter-observer agreement (R^2 = 0.969). Conclusion: AdaPR provides robust, orientation-independent plane reformatting for 4D flow MRI, achieving flow quantification comparable to expert observers. Its adaptability across datasets and scanners makes it a promising candidate for medical imaging applications beyond 4D flow MRI.


Clustering Guided Residual Neural Networks for Multi-Tx Localization in Molecular Communications

Sonmez, Ali, Ozbey, Erencem, Mantaroglu, Efe Feyzi, Yilmaz, H. Birkan

arXiv.org Artificial Intelligence

Abstract--Transmitter localization in Molecular Communication via Diffusion is a critical topic with many applications. However, accurate localization of multiple transmitters is a challenging problem due to the stochastic nature of diffusion and overlapping molecule distributions at the receiver surface. T o address these issues, we introduce clustering-based centroid correction methods that enhance robustness against density variations, and outliers. In addition, we propose two clustering-guided Residual Neural Networks, namely AngleNN for direction refinement and SizeNN for cluster size estimation. Experimental results show that both approaches provide significant improvements with reducing localization error between 69% (2-Tx) and 43% (4-Tx) compared to the K-means.


SLYKLatent: A Learning Framework for Gaze Estimation Using Deep Facial Feature Learning

Adebayo, Samuel, Dessing, Joost C., McLoone, Seán

arXiv.org Artificial Intelligence

In this research, we present SLYKLatent, a novel approach for enhancing gaze estimation by addressing appearance instability challenges in datasets due to aleatoric uncertainties, covariant shifts, and test domain generalization. SLYKLatent utilizes Self-Supervised Learning for initial training with facial expression datasets, followed by refinement with a patch-based tri-branch network and an inverse explained variance-weighted training loss function. Our evaluation on benchmark datasets achieves a 10.9% improvement on Gaze360, supersedes top MPIIFaceGaze results with 3.8%, and leads on a subset of ETH-XGaze by 11.6%, surpassing existing methods by significant margins. Adaptability tests on RAF-DB and Affectnet show 86.4% and 60.9% accuracies, respectively. Ablation studies confirm the effectiveness of SLYKLatent's novel components.


LuxDiT: Lighting Estimation with Video Diffusion Transformer

Liang, Ruofan, He, Kai, Gojcic, Zan, Gilitschenski, Igor, Fidler, Sanja, Vijaykumar, Nandita, Wang, Zian

arXiv.org Artificial Intelligence

Estimating scene lighting from a single image or video remains a longstanding challenge in computer vision and graphics. Learning-based approaches are constrained by the scarcity of ground-truth HDR environment maps, which are expensive to capture and limited in diversity. While recent generative models offer strong priors for image synthesis, lighting estimation remains difficult due to its reliance on indirect visual cues, the need to infer global (non-local) context, and the recovery of high-dynamic-range outputs. We propose LuxDiT, a novel data-driven approach that fine-tunes a video diffusion transformer to generate HDR environment maps conditioned on visual input. Trained on a large synthetic dataset with diverse lighting conditions, our model learns to infer illumination from indirect visual cues and generalizes effectively to real-world scenes. To improve semantic alignment between the input and the predicted environment map, we introduce a low-rank adaptation finetuning strategy using a collected dataset of HDR panoramas. Our method produces accurate lighting predictions with realistic angular high-frequency details, outperforming existing state-of-the-art techniques in both quantitative and qualitative evaluations.


Probabilistic Task Parameterization of Tool-Tissue Interaction via Sparse Landmarks Tracking in Robotic Surgery

Wang, Yiting, Fan, Yunxin, Liu, Fei

arXiv.org Artificial Intelligence

Accurate modeling of tool-tissue interactions in robotic surgery requires precise tracking of deformable tissues and integration of surgical domain knowledge. Traditional methods rely on labor-intensive annotations or rigid assumptions, limiting flexibility. We propose a framework combining sparse keypoint tracking and probabilistic modeling that propagates expert-annotated landmarks across endoscopic frames, even with large tissue deformations. Clustered tissue keypoints enable dynamic local transformation construction via PCA, and tool poses, tracked similarly, are expressed relative to these frames. Embedding these into a Task-Parameterized Gaussian Mixture Model (TP-GMM) integrates data-driven observations with labeled clinical expertise, effectively predicting relative tool-tissue poses and enhancing visual understanding of robotic surgical motions directly from video data.


AlignDiff: Learning Physically-Grounded Camera Alignment via Diffusion

Xie, Liuyue, Guo, Jiancong, Cakmakci, Ozan, Araujo, Andre, Jeni, Laszlo A., Jia, Zhiheng

arXiv.org Artificial Intelligence

Accurate camera calibration is a fundamental task for 3D perception, especially when dealing with real-world, in-the-wild environments where complex optical distortions are common. Existing methods often rely on pre-rectified images or calibration patterns, which limits their applicability and flexibility. In this work, we introduce a novel framework that addresses these challenges by jointly modeling camera intrinsic and extrinsic parameters using a generic ray camera model. Unlike previous approaches, AlignDiff shifts focus from semantic to geometric features, enabling more accurate modeling of local distortions. We propose AlignDiff, a diffusion model conditioned on geometric priors, enabling the simultaneous estimation of camera distortions and scene geometry. To enhance distortion prediction, we incorporate edge-aware attention, focusing the model on geometric features around image edges, rather than semantic content. Furthermore, to enhance generalizability to real-world captures, we incorporate a large database of ray-traced lenses containing over three thousand samples. This database characterizes the distortion inherent in a diverse variety of lens forms. Our experiments demonstrate that the proposed method significantly reduces the angular error of estimated ray bundles by ~8.2 degrees and overall calibration accuracy, outperforming existing approaches on challenging, real-world datasets.


SplatPose: Geometry-Aware 6-DoF Pose Estimation from Single RGB Image via 3D Gaussian Splatting

Yang, Linqi, Zhao, Xiongwei, Sun, Qihao, Wang, Ke, Chen, Ao, Kang, Peng

arXiv.org Artificial Intelligence

6-DoF pose estimation is a fundamental task in computer vision with wide-ranging applications in augmented reality and robotics. Existing single RGB-based methods often compromise accuracy due to their reliance on initial pose estimates and susceptibility to rotational ambiguity, while approaches requiring depth sensors or multi-view setups incur significant deployment costs. To address these limitations, we introduce SplatPose, a novel framework that synergizes 3D Gaussian Splatting (3DGS) with a dual-branch neural architecture to achieve high-precision pose estimation using only a single RGB image. Central to our approach is the Dual-Attention Ray Scoring Network (DARS-Net), which innovatively decouples positional and angular alignment through geometry-domain attention mechanisms, explicitly modeling directional dependencies to mitigate rotational ambiguity. Additionally, a coarse-to-fine optimization pipeline progressively refines pose estimates by aligning dense 2D features between query images and 3DGS-synthesized views, effectively correcting feature misalignment and depth errors from sparse ray sampling. Experiments on three benchmark datasets demonstrate that SplatPose achieves state-of-the-art 6-DoF pose estimation accuracy in single RGB settings, rivaling approaches that depend on depth or multi-view images.


ViIK: Flow-based Vision Inverse Kinematics Solver with Fusing Collision Checking

Meng, Qinglong, Xia, Chongkun, Wang, Xueqian

arXiv.org Artificial Intelligence

Inverse Kinematics (IK) is to find the robot's configurations that satisfy the target pose of the end effector. In motion planning, diverse configurations were required in case a feasible trajectory was not found. Meanwhile, collision checking (CC), e.g. Oriented bounding box (OBB), Discrete Oriented Polytope (DOP), and Quickhull \cite{quickhull}, needs to be done for each configuration provided by the IK solver to ensure every goal configuration for motion planning is available. This means the classical IK solver and CC algorithm should be executed repeatedly for every configuration. Thus, the preparation time is long when the required number of goal configurations is large, e.g. motion planning in cluster environments. Moreover, structured maps, which might be difficult to obtain, were required by classical collision-checking algorithms. To sidestep such two issues, we propose a flow-based vision method that can output diverse available configurations by fusing inverse kinematics and collision checking, named Vision Inverse Kinematics solver (ViIK). Moreover, ViIK uses RGB images as the perception of environments. ViIK can output 1000 configurations within 40 ms, and the accuracy is about 3 millimeters and 1.5 degrees. The higher accuracy can be obtained by being refined by the classical IK solver within a few iterations. The self-collision rates can be lower than 2%. The collision-with-env rates can be lower than 10% in most scenes. The code is available at: https://github.com/AdamQLMeng/ViIK.


DVMNet: Computing Relative Pose for Unseen Objects Beyond Hypotheses

Zhao, Chen, Zhang, Tong, Dang, Zheng, Salzmann, Mathieu

arXiv.org Artificial Intelligence

Determining the relative pose of an object between two images is pivotal to the success of generalizable object pose estimation. Existing approaches typically approximate the continuous pose representation with a large number of discrete pose hypotheses, which incurs a computationally expensive process of scoring each hypothesis at test time. By contrast, we present a Deep Voxel Matching Network (DVMNet) that eliminates the need for pose hypotheses and computes the relative object pose in a single pass. To this end, we map the two input RGB images, reference and query, to their respective voxelized 3D representations. We then pass the resulting voxels through a pose estimation module, where the voxels are aligned and the pose is computed in an end-to-end fashion by solving a least-squares problem. To enhance robustness, we introduce a weighted closest voxel algorithm capable of mitigating the impact of noisy voxels. We conduct extensive experiments on the CO3D, LINEMOD, and Objaverse datasets, demonstrating that our method delivers more accurate relative pose estimates for novel objects at a lower computational cost compared to state-of-the-art methods. Our code is released at: https://github.com/sailor-z/DVMNet/.


Neutrino Reconstruction in TRIDENT Based on Graph Neural Network

Mo, Cen, Zhang, Fuyudi, Li, Liang

arXiv.org Artificial Intelligence

TRopIcal DEep-sea Neutrino Telescope (TRIDENT) is a next-generation neutrino telescope to be located in the South China Sea. With a large detector volume and the use of advanced hybrid digital optical modules (hDOMs), TRIDENT aims to discover multiple astrophysical neutrino sources and probe all-flavor neutrino physics. The reconstruction resolution of primary neutrinos is on the critical path to these scientific goals. We have developed a novel reconstruction method based on graph neural network (GNN) for TRIDENT. In this paper, we present the reconstruction performance of the GNN-based approach on both track- and shower-like neutrino events in TRIDENT.